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Abstract 



Stability is a general notion that quantifies the sensitivity of a learning algorithm's output to small 
change in the training dataset (e.g. deletion or replacement of a single training sample). Such con- 
ditions have recently been shown to be more powerful to characterize learnability in the general 
learning setting under i.i.d. samples where uniform convergence is not necessary for learnabil- 
ity, but where stability is both sufficient and necessary for learnability. We here show that similar 
stability conditions are also sufficient for online learnability, i.e. whether there exists a learning al- 
gorithm such that under any sequence of examples (potentially chosen adversarially) produces a se- 
quence of hypotheses that has no regret in the limit with respect to the best hypothesis in hindsight. 
We introduce online stability, a stability condition related to uniform-leave-one-out stability in the 
batch setting, that is sufficient for online learnability. In particular we show that popular classes 
of online learners, namely algorithms that fall in the category of Follow-the-(Regularized)-Leader, 
Mirror Descent, gradient-based methods and randomized algorithms like Weighted Majority and 
Hedge, are guaranteed to have no regret if they have such online stability property. We provide 
examples that suggest the existence of an algorithm with such stability condition might in fact be 
necessary for online learnability. For the more restricted binary classification setting, we establish 
that such stability condition is in fact both sufficient and necessary. We also show that for a large 
class of online learnable problems in the general learning setting, namely those with a notion of 
sub-exponential covering, no-regret online algorithms that have such stability condition exists. 

1 Introduction 

We co nsider the problem of online learning in a setting similar to the General Setting of Learning (Vapnik, 
In this setting, an online learning algorithm observes data points zx, z%, . . . , z m € Z in sequence, 
potentially chosen adversarially, and upon seeing z\, z%, . . . , Zi—i, the algorithm must pick a hypothesis hi G 
H that incurs loss on the next data point Zi. Given the known loss functional / : H x Z — > K, the regret R m 
of the sequence of hypotheses h\- m after observing m data points is defined as: 



The goal is to pick a sequence of hypotheses h\- m that has no regret, i.e. the average regret —> as the 
number of data points m —> 00. 

The setting we consider is general enough to subsume most, if not all, online learning problems. In fact 
the space Z of possible "data points" could itself be a function space T~L — > K, such that f(h, z) = z(h). 
Hence the typical online learning setting where the adversary picks a loss function H — » K at each time step 
is always subsumed by our setting. The data points z should more loosely be interpreted as the parameters 
that define the loss function at the current time step. For instance, in a supervised classification scenario, the 
space Z = X x y, for X the input features and y the output class and the classification loss is defined as 
f(h, (x, y)) — I(h(x) 7^ y) for / the indicator function. We do not make any assumption about /, other than 
that the maximum instantaneous regret is bounded: swp zeZ h h'eu z ) ~ fQ l 'i z ) I — B- This allows for 
potentially unbounded loss /: e.g., consider z G K, h G [—k, k] and f(h, z) = \h — z\, then the immediate 
loss is unbounded but instantaneous regret is bounded by B = 2k. 

We are interested in characterizing sufficient conditions under which an online algorithm is guaran- 
teed to pick a sequence of hypotheses that has no regret under any sequence of data points an adversary 
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migh t pick. In the batch setting w hen the data points are drawn i.i.d. from some unknown distribution V, 
IShalev-Shwartz et al.l d2010l (2009) have shown that stability is a key property for learnability. In particular, 
they show that a problem is learnable if and only if there exists a universally stable asymptotic empirical risk 
minimizer (AERM). 

In this paper, we consider using batch algorithms in our online setting, where the hypothesis hi is the 
output of the batch learning algorithm on the first i — 1 data points. Many online algorithms (such as Follow- 
the-(Regularized)-Leader, Mirror Descent, Weighted Majority, Hedge, etc.) can be interpreted in this way. 
For instance, Follow-the-Leader (FTL) algorithms can be essentially thought as using a batch empirical risk 
minimizer (ERM) algorithm to select the hypothesis hi on the dataset {zx, 22, ... , Zj-i}, while Follow- 
the-Regularized-Leader (FTRL) algorithms essentialy use a batch AERM algorithm (more precisely what 
we call a Regularized ERM (RERM)) to select the hypo thesis hj on the data s et {z\ , Z2, ■ ■ ■ , Zi-i}- Our 
main result shows that Uniform Le ave-One-Out stability (IShalev-Shwartz et al.L 12009). albeit stronger than 
the stability condition required in (Sh alev-Shwartz et all 1201 Oi 120091) . is in fact sufficient to guarantee no 
regret of RERM type algorithms. For asymmetric algorithms like gradient-based methods (which can also 
be seen as some form of RERM), a notion related to Uniform Leave-One-Out stability (and equivalent to 
for symmetric algorithms), which we call online stability, is also sufficient to guarantee no-regret. We also 
provide general results for the class of always-AERM algorithms (a slightly stronger notion than AERM but 
weaker than ERM and RERM). Unfortunately they are weaker in that they require the algorithm to be stable 
or an always-AERM at a fast enough rate. 

The stronger notion of stability we use to guarantee no regret seems to be necessary in the online setting. 
Intuitively, this is because the algorithm must be able to compete on any sequence of data points, potentially 
chosen adversarially, rather than on i.i.d. sampled data points. We also provide an example that illustrates 
this. Namely, an AERM with a slightly weaker stability condition, can learn the problem in the batch setting 
but cannot in the online setting, however there is a FTRL algorithm that can learn the problem in the online 
setting. Furthermore, it is known that batch learnability and online learnability are not equivalent, which 
naturally suggests stronger notions of stability should be necessary for online learnability. We review a known 
problem of threshold learning over an interval that shows batch and online learnability are not equivalent. In 
the more restricted binary classification setting, we show that existence of a (potentially randomized) uniform- 
LOO stable RERM is both sufficient and necessary for online learnability. We also show that for a large class 
of online learnable problems in the general learning setting, namely those with a notion of sub-exponential 
covering, uniform-LOO stable (potentially randomized) RERM algorithms exist. 

We begin by introducing notation, definitions and reviewing stability notions that have been used in the 
batch setting. We then provide our main results which show how some of these stability notions can be used 
to guarantee no regret in the online setting. We then go over examples that suggest such strong stability notion 
might in fact be necessary in the online setting. We further show that in the restricted binary classification 
setting, such stability notions are in fact necessary. We also introduce a notion of covering that allows us to 
show that uniform-LOO stable RERM algorithms exist for a large class of online learnable problems in the 
general learning setting. We conclude with potential future directions and open questions. 

2 Learnability and Stability in the Batch Setting 

In the batch setting, a batch algorithm is given a set of m i.i.d. samples z%, Z2, . . . , z m drawn from some 
unknown distribution T>, and given knowledge of the loss functional /, we seek to find a hypothesis h E W 
that minimizes the population risk: 

F(h)=E z ^ v [f(h,z)} (2) 
Given a set of m i.i.d. samples S ~ T> m , the empirical risk of a hypothesis h is defined as: 

-. m 

Fs(h) = -T i [f(h,Zi)] (3) 
t=i 

Most batch algorithms used in practice proceed by minimizing the empirical risk, at least asymptotically 
(when an additional regularizer is used). 

Definition 1 An algorithm A is an Empirical Risk Minimizer (ERM) if for any dataset S: 

F S (A(S)) = min F s (h) (4) 

Definition 2 {Shale v-Shwartz et al.l \201(Xi An algorithm A is an Asymptotic Empirical Risk Minimizer 
(AERM) under distribution T> at rate e erm (m) if for all m: 

E s ^ Vm [F s (A(S)) - mm F s (h)] < e erm {m) (5) 
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Whenever we mention a rate e(m), we mean {e(m)}^ =0 is a monotonically non-increasing sequence that is 
o(l), i.e. e(m) — > as m — > 00. If A is an AERM under any distribution V, then we say A is a universal 
AERM. A useful notion for our online setting will be that of an always AERM, which is satisfied by common 
online learners such as FTRL: 

Definition 3 AShalev-Shwartz. et all 1207 fll) An algorithm A is an Always Asymptotic Empirical Risk Mini- 
mizer (always AERM) at rate e erm (m) if for all m and dataset S of m data points: 

F S (A(S))- min F s (h) < e erm (m) (6) 
hen 

Learnability in the batch setting is interested in analyzing the existence of algorithms that are universally 
consistent: 

Definition 4 I Shale v-Shwartz. et al\\201(% An algorithmA is said to be universally consistent at rate e com (m) 
if for all m and distribution V: 

Es~p m [F(A(S)) - min F(h)} < e cons (m) (7) 

heV. 

If such algorithm A exists, we say the problem is learnable. A well known result in the supervised classifi- 
cation and regression setting (i.e the loss f(h, (x, y)) is I(h(x) 7^ y) or (h(x) — y) 2 ) is that learnability is 
equivalent to uniform co nvergence of the empirical risk to the population risk over the class H dBlumer et al.L 
119891 lAlon etail 19971). This implies the problem is learnable using an ERM. 



Shalev-Shwartz et"aT1 d2010l 120 09) recently showed that the situation is much more complex in the Gen- 



eral Learning Setting considered here. For instance, there are convex optimization probl ems where uniform 
convergenc e does not hold that are learnable via an AERM, but not learnable via any ERM (I Shalev-Shwartz eFal 
I2010il2009h . In the General Learning Setting, stability turns out to be a more suitable notion to characterize 
learnability than uniform convergence. 

Most statibility notions studied in the literature fall into two categories: leave-one-out (LOO) stability 
and replace-one (RO) stability. The former measures sensitivity of the algorithm to deletion of a single data 
point from the dataset, while the latter measures sensitivity of the algorithm to replacing one data point in the 
dataset by another. In general these two notions are incomparable and lead to significantly different results 
as we shall see below. We now review the most commonly used stability notions and some of the important 
results from the literature. 



2.1 Leave-One-Out Stability 

Most notions of LOO stability are measured in terms of change in the loss on a leave-one-out sample when 
looking at the output hypothesis trained with and without that sample in the dataset. The four commonly used 
notions of LOO stability (from strongest to weakest) are defined below. We use Zi to denote the i th data point 
in the dataset S and S^ 1 to denote the dataset S with Zi removed. 

Definition 5 \Shalev-Shwartz. et ~al\ \200$) An algorithmA is uniform-LOO Stable at rate ei 0O . st abie(rn) if for 
all m, dataset S of size m and index i S {1, 2, . . . , m}: 

\f(A(S\ l ),z % ) - f(A(S), Zi)\ < e loo . smble {m) (8) 

Definition 6 ( Shale v-Shwartz. et all \200S{) An algorithm A is all-i-LOO Stable under distribution T> at rate 
tioo-stabie(jn) if for all m and index i £ {1,2,..., m}: 

E s ^ Dm [\f(A(S\ l ), Zl ) - f(A(S),Zi)\] < eioo-smbUm) (9) 

Definition 7 ( Shalev-Sh wartz. et all \200$) An algorithmA is LOO Stable under distribution T> at rate ei 00 -stabie (m) 
if for all m: 

1 m 

- VEs^™[I/(^(S V )^) - f(A(S),Zi)\] < e loo . stable (m) (10) 
%—\ 

Definition 8 I Shale v-Shwartz et all \200^l An algorithm A is on-average-LOO Stable under distribution V 
at rate ei 00 . sta bi e {ni) if for all m: 

m 

l-VEs^l/tA^*),^) - f(A(S),Zi)}\ < e loo _ stable {m) (11) 
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Whenever one of these properties holds for all distributions T> we shall say it holds universally (e.g. universal 
on-average-LOO stable). Each of these property implies all the ones below it at the same rate (e.g. a uniform- 
LOO stable algorith m at rate ei 00 _ sta bi e (w) i s also all-i-LOO stable, LOO stable and on-average-LOO stable 
at rate eioo-stabie(w)) (Shalev-Shw artz et al.l,|2009l) . However the implications do not ho ld in the opposite di- 
rectio n, and there are counter examples for each implication in the opposite directions ( Shalev-Shwartz et al., 
2009). The only exception is that for symmetric algorithms A (mean ing the order of the data in t he dataset 
does not matter), then all-i-LOO stable and LOO stable are equivalent (Shale v-Shwartz et ai],l2009l). Some of 
these stability notions have also been studied by different authors under diffe rent names dBousquet and Elisseefll 
l2002llKutin and Nivogill2002llRakhlin et aUEOOllMukherjee et all 120061) sometimes with slight variations 
on the definitions. 

An other even stronger notion of LOO stability simply called uniform stability was studied by Bousquet and Elisseeff 
(2002|). It is similar to uniform-LOO stability except that the absolute difference in loss needs to be smaller 
than eioo-stabie('Ti) at all z £ Z for any held out Zi, instead of just at the held out data point Z{, However, it 
tur ns out we do not need a not ion stronger than Uniform-LOO Stable to guarantee online learnability. 

I Shalev-Shwartz et al.l d2009l) have shown the following two results for AERM and ERM in the General 
Learning Setting: 

Theorem 9 iShalev-Shwartz. et al.l 120091) A problem is learnable if and only if there exists a universal on- 
average-LOO stable AERM. 

Theorem 10 (Shal ev-Shwartz et al.l \200$) A problem is learnable with an ERM if and only if there exists a 
universal LOO stable ERM. 

A nice consequence of this result is that for batch learning in the General Learning Setting, it is sufficient 
to restrict our attention to AERM that have such stability properties. We will see that the notion of LOO 
stability, especially uniform-LOO stability, is very natural to analyze online algorithms as the algorithm must 
output a sequence of hypotheses as the dataset is grown one data point at a time. In the context of batch 
learning, RO stability is a more natural notion and leads to stronger results. 

2.2 Replace-One Stability 

Most notions of RO stability are measured in terms of change in the loss at another sample point when looking 
at the output hypothesis trained with an initial dataset and that dataset with one data point replaced by another. 
We briefly mention two of the strongest RO stability notions that turn out to be both sufficient and necessar y 
for batch learnability. Another weaker notion of RO stability has been studied in Shalev-S hwartz et al. (2010). 
For the definitions below, we denote £?w the dataset S with the i th data point replaced by another data point 
z[. 

Definition 11 i Shal ev-Shwartz. et al\\20ldi) An algorithm A is strongly -uniform-RO Stable at rate e ro - s , a bie{'m) 

if for all m, dataset S of size m and data points z[ and z' : 

|/(A(S«), z') - f(A(S),z')\ < e m . stable {m) (12) 

Definition 12 i Shal ev-Shwartz. et all \2QI0i An algorithm A is uniform-RO Stable at rate e ro . sta bi e {m) if for 
all m, dataset S of size m and data points {z[, z' 2 , . . . , z' m } and z': 

m 

- V |/(A(S«),z') - f(A(S),z')\ < ero-stableim) (13) 
i=l 

The d efinition of strongly-uniform-RO Stable is similar to the definition of uniform stability of Bousquet and Elisseeff 
(2002), except that we replace a data point instead of deleting one. RO stability allows to show the following 
much stronger result than with LOO stability: 

Theorem 13 i Shal ev-Shwartz. et al.l \201(X) A problem is learnable if and only if there exists a uniform-RO 
stable AERM. 

In addition if we allow for randomized algorithms, in that the algorithm outputs a distribution d over H such 
that the loss f(d, z) = ¥*h~d[f(h, z)\, than an even stronger result can be shown: 

Theorem 14 (Shal ev-Shwartz et all 1207 d) A problem is learnable if and only if there exists a strongly- 
uniform-RO stable always AERM (potentially randomized). 

Note that if the problem is learnable and the loss / is convex in h for all z and H is a convex set then there 
must exist a deterministic algorithm that is strongly-uniform-RO stable always AERM, (namely the algorithm 
that returns E^<j[/i] for the distribution d picked by the randomized algorithm). 
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3 Sufficient Stability Conditions in the Online Setting 

We now move our attention to the problem of online learning, where the data points z\, Z2, ■ ■ ■ , z m are 
revealed to the algorithm in sequence and potentially chosen adversarially given knowledge of the algorithm 
A. We consider using a batch algorithm in this online setting in the following way: let Si = {z\, Z2, ■ ■ ■ , Zi) 
denote the dataset of the first i data points; at each time i, after observing Si-i, the batch algorithm A is 
used to pick the hypothesis hi = A(Sj_i). As mentioned previously, online algorithms like Follow-the- 
(Regularized)-Leader can be thought of in th i s way . This can also be thought as a batch-to-online reduction, 
similar to the approach of Kakade and Kalai ( 2006), where w e reduce online learning to solving a sequence 
of batch learning problems. Unlike dKakade and Kalaa,l2006l) we consider the general learning setting instead 
of the supervised classification setting and do not make the transductive assumption that we have access to 
future "unlabeled" data points. Hence our results can be interpreted as a set of general conditions under which 
batch algorithms can be used to obtain a no regret algorithm for online learning. 
We now begin by introducing some definitions particular to the online setting: 

Definition 15 An algorithm A has no regret at rate e r egret{tn) if for all m and any sequence zi, z%, . . . , z m , 
potentially chosen adversarially given knowledge of A., it holds that: 

— /(A(S'i_i), Zi) - min — V* f(h, z { ) < e regret {m) (14) 
m ' hen m 

i—l i—1 

If such algorithm A exists, we say the problem is online learnable. It is well known that the FTL algorithm 
has no regret at rate O(^j^) for Lipsch itz continuous and strongly convex loss / in h at all z (Haza n~eit al.L 
2006, Kakade and Shalev-Shwartzll2008h . Additionnally, if f is Lipschitz continuous and co nvex in h at all 
z, then the FTRL algorithm has no regret at rate O(^) dKakade and Shalev-Shwartll2008l) . 

An important subclass of always AERM algorithms is what we define as a Regularized ERM (RERM): 

Definition 16 An algorithm A is a Regularized ERM if for all m and any dataset S of m data points: 

m m 

r (A(5)) + Y,[f(A(S),Zi) + n(A(S))} = mmr (ft) + Zi ) + n(h)} (15) 

»=1 1 i=l 

where {rj}™ is a sequence of regularize r functionals (V, : H R), which measure the complexity of a 
hypothesis h, and that satisfy sup h y e fi \ri(h) — fi(h')\ < Pifor all i where {pi}^ is a sequence that is 

o(l). 

It is easy to see that any RERM algorithm is always AERM at rate — Y^iLo Pi- Additionally, an ERM is a 
special case of a RERM where r, = for all i. This subclass is important for online learning as FTRL can 
be thought of as using an underlying RERM to pick the sequence of hypotheses. Typically FTRL chooses 
r, = \ir for some regularizer r and Aj a regularization constant such that {\i}°°^ is o(l). Many Mirror 
Desce nt type algo r ithms such as gradient descent can also be interpreted as some form of RERM (see section 
|4]and ( McMahanl 1201 ll) ) but where may depend on previously seen datapoints. Additionally Weighted 
Majority /Hedge type algorithms can also be interpreted as Randomized RERM (see section|5]l. Our strongest 
result for online learnability will be particular to the class of RERM. 

A notion of stability related to uniform-LOO stability (but slightly weaker) that will be sufficient for our 
online setting is what we define as online stability: 

Definition 17 An algorithm A is Online Stable at rate t n-siabu(jn) If for all m, dataset S of size m: 

\f(A(S\ m ), zm) - f(A(S),z m )\ < e on . stable {m) (16) 

The difference between online stability and uniform-LOO stability is that it is only required to have small 
change in loss on the last data point when it is held out, rather than any data point in the dataset S. For 
symmetric algorithms (e.g. FTL/FTRL algorithms), online stability is equivalent to uniform-LOO stability, 
however it is weaker than uniform-LOO stability for asymmetric algorithms, like gradient-based methods 
analyzed in Section [4] It is also obvious that an uniform-LOO stable algorithm must also be online stable at 

rate e on - sm bie(™) < eioo-stabie_("i). 

We now present our main results for the class of RERM and always AERM: 

Theorem 18 If there exists an online stable RERM, then the problem is online learnable. In particular, it has 
no regret at rate: 



m (-j m— 1 

t regret {m) < — > e „. s , a bie{i) H > Pi H (17) 



m * — ' m A — ' m 

i=l i=0 
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This theorem implies that both FTL and FTRL algorithms are guaranteed to achieve no regret on any problem 
where they are online stable (or uniform-LOO stable as these algorithms are symmetric). In fact it is easy to 
show that in the case where / is strongly convex in h, FTL is uniform-LOO stable at rate O(^) (see Lemma 
[26t , Additionally when / is convex in h, it is easy to show FTRL is uniform-LOO stable at rate 0(^j=) when 

choosing a strongly convex regularizer r such that r m — X m r and X m to be 8(1/ \fm) (see Lemma l27l and 
[28), while FTL is not uniform-LOO stable. It is well known that FTL is not a no regret algorithm for general 
convex problem. Hence using only uniform-LOO stability we can prove currently known results about FTL 
and FTRL. 

An interesting application of this result is in the context of apprenticeship/imitation learning, where it 
has been shown tha t such non - i.i.d. supervised learning problems can be reduced to online learning over 
mini-batch of data (Ros s et all 1201 lb . In this reduction, a classification algorithm is used to pick the next 
"leader" (best classifier in hindsight) at each iteration of training, that is in turn used to collect more data (to 
add to the training dataset for the next iteration) from the expert we want to mimic. This result implies that 
online stability (or uniform-LOO stability) of the base classification algorithm in this reduction is sufficient 
to guarantee no regret, and hence that the reduction provides a good bound on performance. 

Unfortunately our current result for the class of always AERM is weaker: 

Theorem 19 If there exists an always AERM such that either (1) or (2) holds: 

1. It is always AERM at rate o(— ) and online stable. 

2. It is symmetric, uniform LOO stable at rate o(^) and uniform RO stable at rate o(^). 
then the problem is online learnable. In particular, for each case it has no regret at rate: 

1- e regret{ m ) ^ Ei=l e on-stable(i>) + ^ Ei=l '^rmf') 

2- £ regre ,(m) < ~ Ej=l e loo-stable(i) + e erm( m ) + ^ Ei=l ^[ e loo-stable(i) + ^ro-stable{i)} 

We believe the required rates of o(^) might simply be an artefact of our particular proof technique and that 
in general it might be true that any always AERM achieves no regret as long as it is online stable. We weren't 
able to find a counter-example where this is not the case. 

3.1 Detailed Analysis 

We will use the notation R m (A) to denote the regret (as in Equation [T) of the sequence of hypotheses pre- 
dicted by algorithm A. We begin by showing the following lemma that will allow us to relate the regret of 
any algorithm to its online stability and AERM properties. 

Lemma 20 For any algorithm A: 

Rm(A) - Ei=iI/(A(5i-i), Zi) - f{A{Si),Zi)] + YZ i f(MS m ), Zi) - mm hen YZi ffa z i) 

+ YZ~i Ej=i[/(A(Si), - f(MSi+i), zi)] 

(18) 

Proof: 

#m(A) m 

= E™ i /( A (<Si-i), Zi) - min h&n E™i /O, z») 

= E£i[/(A(Si-i), Zi) - f(A(S m ), Zi)} + YZx f(MSm),Zi) - ™m heH YZi f(h, z>) 
For the term — EfcLi /OM^m), Zi), we can rewrite it using the following manipulation: 

- lZT=i f(MS m ),Zi) + f(A(S m ),z m ) 

= ETJi 1 f(MS m -i), Zi) + E;Ti 1 [/(A(^„), Xj ) - /(A(S m _i), Zj)] + f(A(S m ), z m ) 

= YZi f(MSi),Zi) + E™! 1 E}=i[/(A(5 i+1 ), Zj ) f(A(Si), zj)} 

This proves the lemma. ■ 

From this lemma we can immediately see that for any online stable always AERM algorithm A we obtain 
the following: 
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Corollary 21 For any online stable always AERM algorithm A: 

m m— 1 i 

£ oft -st able 

(i) + me erm {m) + ^ y][f(MSj)> z j) ~ f(M s i+i)> z j)] ( 19 ) 

i=l i=l j=l 

Proof: By online stability we have that for all i: 

/(A(5 l _ 1 ),z J )-/(A(5. i ),z t ) < |/(A(5 i _ 1 ),z i )-/(A(S i ),«i)| - |/(A(S}*), ^-/(A^), z<)| < e n-stabie« 
and since A is always AERM it follows by definition that: 

m m 

f(A(S m ),Zi) - min /(ft., 2{) < me ena (m) 
hGH 

i=l i=l 

■ 

We will now seek to upper bound the extra double summation part. For an ERM it can easily be seen that: 
Lemma 22 For any ERM algorithm A: 

m— 1 i 

- ffA(Si+i),Zj)] < (20) 

i=l 3=1 

Proof: Follows immediately since Ej=i /(A(Si), Zj) is optimal hence for any other hypothesis h, in partic- 

nlttA(Si +1 ),£5=i/(A(#)»*j) < E5=i/(A(Si+i),*i). ■ 

Since an ERM has e e rm( TO ) = for all m, then it can be seen directly that an ERM has no regret if it is 

online stable, as — < ^ E™ 1 eon-stabie(i)- 

For general RERM this double summation can be bounded by: 

Lemma 23 For any RERM algorithm A: 

m— 1 i m—1 

^[/(A(^),^) - f(A(S l+1 ), Zj )} < ]T p 4 W 

i=l j=l i=0 

Proof: 

E:i 1 Epi[/(A(s,),^)-/(A(s, + i),z j )] 

= E£=7 x [£i=i[/(A(Si), + e; =0 ^(a(5,)) - ^(ACSO)] 

-Ei=i[/( A (^+i)»^)]-E;-=o[^(A(5i + i))-r,-(A(5 i+ i))]] 

< EI'IT 1 E;=o^(A(^+i)) - r i (A(5 i ))] 
= E™o 1 h(A(^ m ))-r l (A(^))] 

< E l=0 Pi 

■ 

Combining this result with Corollary l2Tlproves our main result in Theorem[l8] using the fact that a RERM 

is always AERM at rate — Ei=o Pi- 
It is however harder to bound this double summation by a term that becomes negligible (when looking at 

the average regret) for general always AERM. We can show the following: 

Lemma 24 For any always AERM algorithm A: 

m — 1 i m — 1 

Y,\f(MSi),Zj) - f(A(S i+1 ), Zj)} < J2 ( 22 ) 

i=l j=l i=l 

Proof: 

Ei=7 1 Ej-=i[/(A(Si),2j) - f(A(S i+1 ),Zj)] 

< E"l lie ermW 

■ 

This proves case (1) of Theorem[T9l when combining with Corollary |2T1 If we have a symmetric always 
AERM that is uniform LOO stable and uniform RO stable then we can also show: 



7 



Lemma 25 For any symmetric always AERM algorithm A that is both uniform LOO stable and uniform RO 
stable: 

m — 1 i m — 1 

^ }][f(A(Si),Zj) - f(A(Si+i), Zj)] < ^ i[ e loo- S 1abu{i) + em-stable(i)] (23) 
i— 1 j—1 i—1 

Proof: 

EI^ 1 EU MMSiUj) f(MS i+ i), zj)] 

= T^iT.U\fiMSi),Zj) - mS^Zj) + mS^zj) - f(A(S i+1 ), Zj )] 

For symmetric algorithms, the terms X)j =1 [/(A(S'i), Zj) — f(A(s}^ 1 ),Zj)] are related to RO stability as 

SMjJ corresponds to where we replace Zj by Zi+i. Hence for symmetric algorithms, by definition 

of uniform RO stability we have: E}=i [f(A(Si), Zj) — f(A(s}^ 1 ) 1 Zj)] < je ro . st abie(*)- Furthermore by 

definition of uniform LOO stability, the terms Ej=i /(A(>>i+i)> z j) — /(A(S*i+i), Zj) < ieioo-stabie(i)- This 
proves the lemma. ■ 

This lemma proves case (2) of Theorem[T9l when combining with Corollary |2TI 

Now we show that strong convexity, either in / or in when / is only convex, implies uniform-LOO 
stability: 

Lemma 26 For any ERM A: IfH is a convex set, and for some norm || • 1 1 on H we have that at all z G Z: 
/(•, z) is L-Lipschitz continuous in || • || and v-strongly convex in || • ||, then A is uniform-LOO stable at rate 

£loo-stable(Tn) < f^J- 

Proof: By Lipschitz continuity we have \f(A(S\% Zi ) - f(A(S), Zi )\ < L\\A(S\ l ) - A(S)\\. We can use 
strong convexity to bound ||A(S'\ l ) — A(5)||: For all a € (0, 1) we have: 

E;=i *f( A (S^h Zj) + (1 - a)f(A(S), Zj) 

> E™ i f(<*A(S\*) + (1 - a)A(S),Zj) + a{l -^ miJ \\A{S^) - A{S)f 

> £;ii f(A(S),Zj) + ^^\\A(S\>) - A(SW 

where the last inequality follows from the fact that A(S) is the ERM on S. So we obtain for all a £ (0, 1): 

\\A(S\*)-A(SW < ^i^T:^[f{MS\%z )~ f(A(S),Zj)}. 

Since A(S\*) is the ERM on S*, then E,"li|, ¥ i f(MS), Zj) > E;ii|^J/(^(^ V ), *i) so: 

\\A(S\>)~ A(S)\f < - 7 ^Y,T=iif( A (S\ i ),z j )-f(A(S),Zj)} 

< ^^L\\MS^) - A(S)\\ 

Hence we conclude ||A(5^ 1 ) — ^4(5)11 < mv ^__ a ) L- Since this holds for all a 6 (0, 1) then we conclude 
- A(S)\ \ < ^L. This proves the lemma. ■ 

Lemma 27 For any RERM A: IfH is a convex set, and for some norm \ \ ■ \ \ on W we have that at all z 6 Z, 
f(-,z) is convex and L-Lipschitz continuous in \\ ■ \\, andforall i, is L l R -Lipschitz continuous in \\ ■ \\ and 

Vi-strongly convex in II • II, then A is uniform-LOO stable at rate tioo-staUeijn) < 2 "^m +Lj? ^ ■ 

Proof: By Lipschitz continuity we have \f(A(S\ i ), z t ) - f(A(S), < L\\A(S\ l ) - A(S)\\. We can use 
strong convexity of the regularizers to bound ^(S^ 1 ) — |: For all a g (0, 1) we have: 

E™=i af(A(S\% Zj ) + (1 - a)f(A(S), Zj ) + E™o «rj (MS*)) + (1 - a^MS*)) 

> E™=i /M(^) + (1 - a)A(S),Zj) + J2T=o r(aA(SV) + (1 - a)A(S)) + ^lf^ »_i \\ A{S V) - A(5)|| 2 

> E7=i f(MS), zj) + £™ rj(A(S)) + tt( '' a) p^ ||i(gV) _ A (SW 

where the last inequality follows from the fact that A(S) minimizes Ej=i + EjLo r j(^)- S° 

we obtain for all a e (0, 1): \\A{S\*) - A(5)|| 2 < ^ J. {1 _ a) [E^i If (MS*), Zj) f(A(S),Zj)} + 
ET=oh(MS V ))-r,(A(S))}}. 
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Since A(S\*) minimizes [/(*> ^Ol+E'o rj (h), then £™ i b¥i /(^), ^)+E™o > 

£7=ibW/(^ V ), *i) + E^o 1 r^(^)) so: 

- A(S)|| 2 < ^ m J j{1 _ a) lE7=ilf( A (S\%z 3 ) f(A(S), Zl )]+j:T=oh(MS\ 1 )) ~ rMiS))}} 
^ ^V^i-o) ]/^').^) - + r m (A(SV)) -r m (A(5))] 

^ r™^r^)[ L + L S]P(^)-^)ll 

Hence we conclude A(S) \ \ < £ » ^-(i-a) t £ + Since this holds for all a e (0,1) then we 

conclude ||A(S'\ i ) - < 2 . [L + L™]. This proves the lemma. ■ 

We also prove an alternate result for the case where the regularizers j-j are strongly convex but not neces- 
sarily Lipschitz continuous: 

Lemma 28 For any RERM A: IfH is a convex set, and for some norm \ \ ■ \ \ on H we have that at all z G Z, 
/(• , z) is convex and L-Lipschitz continuous in 1 1 • 1 1, and for all i > 0, is Vi-strongly convex in \\ ■ \ \ and 

2£ 2 _i_ T, 0£ 



su P/i.ft'G« \ r i{h) — 'fi(h')\ < Pi, then A is uniform-LOO stable at rate ei 00 ^ la hie{m) < — - + L./ ^J! m . ■ 

Proof: Following a similar proof to the previous proof and using the fact that [r m (A(S'\ i )) - r m (A(S))] < 
p m we obtain that: \\A(S\ l ) - A(S)\\ 2 < 2L : \\A(S^) - A(S)\\ + , . This is a quadratic inequality 

of the form At 2 + _Bx + C < 0. Since here A = 1 > 0, then this implies a; is less than or equal to the 
largest root of Ax 2 + Bx + C. We know that the roots are x = - B ±^B^-i AC^ Here A = 1, B = 

-= p — and C = - 2 l "' . So the largest root is: x = ™£ — [1 + J 1 + 2pm ^f=° 1 . We conclude 



- A(S)\\ < ^-[1 + v/l+ ^^r^ ]. Since ^1 + 2p "^r" < 1 + y f 2 *"^ 1 " 



we obtain \\A(S\ l ) - A(S)\\ < 2L + J J% m v . . Combining with the fact that \}(A(S\ l ),z % ) 
f(A(S),Zi)\ < LWAiS^) - A(S)\\ proves the lemma. 



4 Mirror Descent and Gradient-Based Methods 

So far we have thought of using an underlying batch algorithm to pick the sequence of hypotheses. A popular 
class of on l ine methods are gradient b ased methods, such as gradient descent and Newton's type methods 
dZinkevichl 120031 lAgarwal et all 1200 61). Such approaches can all be interpreted as M irror Descent met hods, 
and it is known that Mirror Descent algorithms can be thought as some form of FTRL (McMahi3,|20lib. The 
difference is that they follow the regularized leader on a linear/quadratic approximation to the loss function 
(linear/quadratic lower bound in the convex/strongly convex case) at each data point z, and the regularizers 
ri may regularize about the previously chosen hi (after observing the first i — 1 datapoints) rather than 
some fixed hypothesis over the iterations (such as hi). These algorithms are typically not symmetric, as the 
approximation points to the loss function (and potentially the regularizers) depend on the order of the data 
points in the dataset. 

Nevertheless, we can still use our previous analysis to bound the regret for these methods in terms of 
online stability and AERM properties. We will refer to this broad class of methods as Regularized Surrogate 
Loss Minimizer (RSLM): 

Definition 29 An algorithm A is a Regularized Surrogate Loss Minimizer ( RSLM) if for all m and any dataset 

5 of m data points: 

m m 

r (A(S))+y2[£ l (A(S),z l )+n(A(S))} = mmr {h) + VU(/i, *) + r^h)] (24) 

i=l i=l 

for the surrogate loss functionals chosen such that f(A(Si-i), zi) — f(h, Zi) < £i(A(Si-\), Zi) — 

£i(h, Zi) for all h (i.e. they upper bound the regret), {ri}°^ the regularizers functionals such that sup h ^'g-n l r *(^) — 
n(h>')\ < pi and {pi}™ is o(l). 

Note that a RERM is a special case of a RSLM where £i(h, z{) = f(h, Zi). 
For the broader class of RSLM, the regret is bounded by: 
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Lemma 30 For any RSLM A: 

Rm(A) < YZSM{ S i-i)^i) ~ WS&Zi)] + YZi ti(£(S m ), Zi) - min fcew i Uih, zi) 

+J2Z~i 1 T: j = 1 MMSilz j )-£ j (A(s i+1 ),z j )] 

(25) 

Proof: By properties of the functions £j we have that: i? m (A) < YliLi A'(A(jS{_i), z%)— minhen Y^iLi &i(h, Zi). 
Using the same manipulations as in lemma[20]proves the lemma. ■ 

A RSLM is a RERM in the loss {4}™ i instead of /. Hence it follows that if such RSLM is online stable 
(in the loss {^i}™ x , i.e. \£ m (A(S m -i), z m )) — £ m (A(S m ), z m ))\ — > as to — > oo) it must have no regret: 

Theorem 31 If there exists a RSLM that is online stable in the surrogate loss {^t}£i> then the problem is 
online learnable. In particular, it has no regret at rate: 

e reg ret(m) < — eon-stabteii) H Pi + — (26) 

m ' to m 

i=l i=0 

Proof: Follows from applying corollary |2T] and lemma |231 (but replacing / by {£i}) to the previous lemma 
M ■ 



5 Weighted Majority, Hedge and Randomized Algorithms 

We have so far restricted our attention to deterministic algorithms, which upon observing a dataset S return a 
fixed hypothesis h E H. An important class of methods for online learning are randomized algorithms such 
as Weighted Majority, and its generalization Hedge, which instead return a distribution over hypotheses in 
H at each iteration. These randomized algorithms are important in online learning as it is known that some 
problems are not online learnable with deterministic algorithms but are online learnable with randomized al- 
gorithms (assuming the adversary can only be aware of the distribution over hypotheses and not the particular 
hypothesis that will be sampled from this distribution when choosing the data point z). For instance, general 
problems with a finite set of hypotheses fall in this category. 

In this section we show that Weighted Majority, Hedge and similar variants, can be interpreted as Ran- 
domized uniform-LOO stable RERM. We provide an analysis of the stability, AERM and no-regret rates of 
such algorithms based on the previous results derived in this paper. These results will be useful to determine 
the existence of (potentially randomized) uniform-LOO stable RERM for a large class of learning problems. 
Before we introduce this analysis, we first define formally what we mean by a Randomized RERM and how 
notions of stability and no-regret extend to randomized algorithms. 

5.1 Randomized Algorithms 

Definition 32 Let be a set such that for any 9 6 O, Pg is a probability distribution over the class of 
hypothesis %, and for any h £ H, and e > there exists a 6 G such that ~Kh'~p e [f{h'. z)] — f(h, z) < e 
for all z € Z. Let Pg s = A.(S) denote the distribution picked by algorithm A on dataset S. An algorithm A 
is a Randomized RERM if for all m and any dataset S: 

m m 

ro(0s) + y>ft~p es [f(h, Zi)] + n(9 s )} = mmr (6) + £pEW> [f(h, *)] + n(9)} (27) 

i=l i=l 

for Ti : — > M the regularize r functionals, which measure the complexity of a chosen 6, that we assume 
satisfy sup , ee \n{0) - n(6')\ < pi and {p m }m=o is 

The set might represent a set of parameters parametrizing a family of distributions (e.g. a set of mean- 
variance tuples such that Pg is gaussian with those parameters), or in other cases be a set of distribution itself 
(e.g. when H is finite, might be the set of all discrete distributions over H), in which case Pg — 9. The 
condition that there exists a 9 e such that E^'~p e [f(h', z)] — f(h, z) < e for all z g Z is to ensure the 
algorithm is an AERM, i.e. that it can pick a 9 that has average expected loss no greater than the best fixed 
h e % in the limit as to — > oo. A deterministic RERM is a special case of a Randomized RERM where the 
set = H and Pg is just the probability distribution with probability 1 for the chosen hypothesis 8. 

When using a randomized algorithm, the algorithm incurs loss on a hypothesis h sampled from the chosen 
Pg, and we assume the adversary may only be aware of Pg in advance (not the particular sampled h) when 
choosing z. The previous definitions of stability, AERM and no-regret extends to randomized algorithms by 
considering the loss f(A(S), z) — E^^a(S) [f(h, z)]. Thus a no-regret randomized algorithm is an algorithm 
such that its expected average regret under the sequence of chosen distributions goes to as m goes to oo. By 
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our assumption that the instantaneous regret is bounded, this is also equivalent to saying that its average regret 
(under the sampled hypotheses) goes to with probability 1 as m goes to oo (e.g. using an Hoeffding bound). 
Additionally, a randomized online stable algorithm implies that the change in expected loss on the last data 
point when it is held out goes to as m goes to oo (|E/i^A(s\ m ) {fQ 1 , z m )} — ^h~A(S) [f(h, z m )]\ — > 0). 

5.2 Hedge and Weighted Majority 

An important randomized n o-regret online learning algorithm when T~L is finite is the Hedge algorithm 
dFreund and Schapirelll997h . Hedge is a general ization to arbitrary loss of the W eighted Majority algorithm 
that was introduced for the classification setting dLittlestone and Warmuthl,[l994l) . Let 0j denote the probabil- 
ity of hypothesis hi, then at any iteration t, Hedge/Weighted Majority plays 9i oc exp(— 77 Y?j=i /(^ii z j)) 
for some positive constant -q. When the number of rounds m is known in advance, r\ is typically chosen as 
O(By^^j^y), for B the maximum instantaneous regret. We will consider here a slight generalization of 
Hedge that can be applied for cases where the number of rounds is not known in advance. In this case at 
iteration t: 6i oc exp(— r\ t Yfj=i fi^it z j)) f° r some sequence of positive constants {^}^ - We show here 
that Hedge (and Weighted Majority) is in fact a Randomized uniform-LOO stable RERM, where is the set 
of all discrete distributions over the finite set of experts, and the regularizer corresponds to a KL divergence 
between the chosen distribution and the uniform distribution over experts: 

Theorem 33 For finite set of d experts with instantaneous regret bounded by B, the Hedge (and Weighted 
Majority) algorithm corresponds to the following Randomized uniform-LOO stable RERM. Let be the set 
of distributions over the finite set of d experts, and U denote the uniform distribution, then at each iteration 
t, Hedge (and Weighted Majority) picks the distribution 9* £ that satisfies: 



9* = argminV E h „ e [f(h, «,)] +VW| | U) 

i.e. it uses r t — Xtr for r a KL regularizer with respect to the uniform distribution. Choosing the regular- 
ization constants At = BJ gfo^7gj^^7i t ) f or cill t > makes Hedge (and Weighted Majority) uniform- 



LOO stable at rate €i 0o . sta bi e (rn) < By/ 2 log(ri) [ 2 7 + , 1 ], always AERM at rate e erm (m) < 



B\j^{l + ^) and no-regret at rate e regret (m) < B^2^d)[^ + ^ + i±|^l]. 

Proof: Consider the above Randomized RERM algorithm. Then we have < XiKL(9\\U) < Ajlog(d), 
for all i and 9 £ 0. So and {ri}°l Q are well defined according to our assumptions in the definition of a 
Randomized RERM as long as {Ai}°^ is o(l). Let hi denote the i th expert and 9i denote the probability 
assigned to hi for a chosen 9 £ 0. At any iteration t + 1, when the algorithm has observed t data points so 
far, the randomized RERM algorithm solves an optimization problem of the form: 

argmin eee £* =1 £)f =1 9 l f{h il z 3 ) + ^* =0 Xj Ef=i °i lo g(^i) 
s.t. < 9i < 1 

Eti^ = i 

Using the Lagrangian, we can easily see that the optimal solution to this optimization problem is to choose 
9i oc exp(— „ t 1 . y)*-_i f(hi, Zj)) for all i. This is the same as Hedge for m = w 1 , ■ When Hedge is 

playing for m rounds and uses a fixed i], this can be achieved with a fixed regularizer Ao = - and A f = 
for alH > 1. So this establishes that Hedge is equivalent to the above RERM. Now let's consider the case 
where the number or rounds m is not known in advance and we choose A t = c. \n\ f° r all t > and 

\ / 111 &X ( t , 1 ) 

some constant c in the above RERM. This choice leads to 2c\/i — c < Y^j=o — % c Vi + c. Note also that 

because At < Xj for all j < t we also have that (t + l)Aj < X^=o ^3 • ^ ^ s eas y to see wn y t ^ le arj ove RERM 
must be uniform-LOO stable. First the expected loss of the randomized algorithm is linear in 9 (and hence 
convex) while the KL regularizer is 1 -strongly convex in 9 under || • ||i and bounded by log(tf) (so r m is 
A m -strongly convex and bounded by A m log(d)). Additionnally, the expected loss is L-Lipschitz continuous 

in 1 1 • ||i on 9, for L — sup zeZ infueM sup^g-^ \f(h, z) — v\. This is because for any z: 

\E h ^ Pe [f(h,z)]-E h ^p e , [f(h,z)]\ 



= \U=iVi-W(hi,z)-v)\ 

< T^=x\Oi-0i\\f(hi,z)-v\ 

< S up h&n \f(h,z)-v\\\9-e'\\ 1 
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for any v e R. So we conclude that for all z € Z, |E,^ Ps [f(h, z)} - E h ^ Pe , [f(h, z)}\ < L\\6 - 6'\\i. If 
the loss / have instantaneous regret bounded by B, then L = -j. So by our previous result for RERM 
with convex loss and strongly convex regularizes, we obtain that the algorithm is uniform-LOO stable 

2L 2 , t /2X m log(rf) 



at rate ei 00 -stabie (m) < yyT x . + ^ m ^ ■ So that the algorithm has no regret at rate e regret (m) < 

- E ^ [ e&7 + L \fW^ ] + 2J2 ^ A ^ + lJ ^- Settin g ^ = c v^5S leads to: 

e regret( TO ) 

2L 2 , r / 2Xj log(d) 1 , 2 log(rf) , log(ri)A r , 



<- J_ r 21/ , r / 2Aj log(d) i , 21og(rf) , 



< + 5 log(m) + lo g(2)) + L^jjd)^ + 4clog(d)^ + ^ 

= 1 [2f + 4clog(d) + 2LV2loi(d)] + ^[^1 + + 2clog(d)] 



where the second inequality uses _j 1 < - — i — , „; A * < ^-r and X)i=o A; — 2cy / m + c; and 
the third inequality uses the fact that Y^T=i 2J1-1 < \A™ + \ log(m) + log(2) and YliLi — ^V™> 



which follows from using the integrals to upper bound the summations. Setting c = L y 2 \ os (d) mm i m i zes 
the factor multiplying the -^L term. This leads to e regret (m) < L^21og(d)[^ + + 1+2 jf (2) ], 

eioo- s mbie(m) < Ly2bi(d)[^rT + ^=tt] and e erm (m) < L^j&(l + ^-). Plugging in L = f 
proves the statements in the theorem. ■ 

This theorem establishes the following: 

Corollary 34 Any learning problem with a finite hypothesis class (and bounded instantaneous regret) is 
online learnable with a (potentially randomized) uniform-LOO stable RERM. 



In Section I6TTI we will also demonstrate that when % is infinite, but can be "finitely approximated" well 
enough with respect to the loss /, then the problem is also online learnable via a (potentially randomized) 
uniform-LOO stable RERM. 



6 Is Uniform LOO Stability Necessary? 

We now restrict our attention to symmetric algorithms where we have shown that uniform-LOO stability is 
sufficient for online learnability. We start by giving instructive examples that illustrate that in fact uniform- 
LOO stability might be necessary to achieve no regret. 



Example 6.1 There exists a problem that is learnable in the batch setting with an ERM that is universal 
all-i-LOO stable. However that problem is not online learnable (by any deterministic algorithm) and there 
does not exist any (deterministic) algorithm that can be both uniform LOO stable and always AERM. When 
allowing randomized algorithms (convexifying the problem), the problem is online learnable via a uniform 
LOO stable RERM but there exists ( randomized) universal all-i-LOO stable RERM that are not uniform-LOO 
stable that cannot achieve no regret. 

Proof: This example was studied in both dKutin and Nivogil 120021 Tshalev -Shwa rtz et al. , 2009). Consider 
the hypothesis space H = {0, lj, the instance space Z — {0,1} and the loss f(h,z) — \h—z\. As was shown 
in dShalev-Shwartz et al.L 120091) for the batch setting, an ERM for this problem is universally consistent and 
universally all-i-LOO stable, because removing a data point z from the dataset can change the hypothesis 
only if there's an equal number of 0's and l's (plus or minus one), which occurs with probability 0(— =). 
IShalev-Shwartz et al.l d2009l) also showed that the only uniform LOO stable algorithms on this problem must 
be constant (i.e. always return the same hypothesis h, regardless of the dataset), at least for large enough 
dataset, and hence cannot be an AERM. 

It is also easy to see that this problem is not online learnable with any deterministic algorithm A. Consider 
an adversary who has knowledge of A and picks the data points z% = 1 — A(S,_i). Then algorithm A incurs 
loss YhLi /(A(S'i-i), Zi) = to, while there exists a hypothesis h that achieves 5Z i=1 f(h, Zi) < ^. Hence 

for any deterministic algorithm A, there exists a sequence of data points such that R ™ r [ A ^ > \ for all to. 
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Now consider allowing randomized algorithms, in that we choose a distribution over {0, 1}. Allowing 
randomized algorithms makes the problem linear (and hence convex) in the distribution (by linearity of expec- 
tation) and makes the hypothesis space (the space of distributions on %) convex. Let p denote the probability 
of hypothesis 1. Then the problem can now be expressed with a hypothesis space p e [0, 1] and the loss 
f(p, z) = (1 — p)z + p(l - z). 

This problem is obviously online learnable with a randomized uniform-LOO stable RERM (i.e. Hedge) 
that is uniform-LOO stable at rate 0(-^=) and no-regret at rate 0(— =) using our previous results. 

Even under this change, the previous ERM algorithm that is universally all-i-LOO stable would still 
choose the same hypothesis as before, i.e. p would be always or 1 and would not be uniform-LOO stable. 
That would also be the case even if we make it pick p = | or some other intermediate value when there is an 
equal number of 0's and l's. If we make it pick such intermediate value it would still be universal all-i-LOO 
stable as the hypothesis would still only change with small probability 0(^j=). However such algorithm 

cannot achieve no regret. Again if we pick the sequence Zi = round(l — A(Si-i)), then whenever i is even, 
the ERM use an odd number of data points and it must pick either or 1 and would incur loss of 1 . When i is 
odd, there will be an equal number of 0's and l's in the dataset (by the fact A chooses the ERM at odd steps) 

1 R (A) i 

and no matter what p it picks it would incur loss of at least ^. Thus — > j for all m. 

We can also consider the following randomized RERM algorithm that uses only a convex regularizer: 
A(S) = argmin pe[01] £™ 1 f(p, Zi ) + £™ X t \p - \\. Let z = ± JXi «• and \= T^EZo A *- Usin 8 
the subgradient of this objective, we can easily show that A(S) picks | if z € [^^, ^2 ]' an< ^ otherwise 
picks either p = 1 if z > -Mp andp = if z < This algorithm is not uniform-LOO stable, as for any 
regularizer A m and large enough m we can pick a dataset S m such that S m _i has z £ [^^, ^2 ] ^ ut ^ m ^ as 
z i such that f(A(S m ), z m ) = but /(A(S' m _ 1 ), z m ) = \. Hence e loo _ stable (m) > \. However 

it is universal all-i-LOO stable as the hypothesis would still only change with small probability (3(-i=) as in 

the previous case (we need to draw m samples that has number of 1 's or -Mp , plus or minus one, for the 
hypothesis to change upon removal of a sample). 

Furthermore this algorithm doesn't achieve no regret. Consider the sequence where whenever A(Si_i) 
picks ^ we pick Zi — 1 and whenever A(Si-i) picks 1 we pick Zi = 0. It is easy to see that by the way 

this sequence is generated that the proportion of l's z in S m will seek to track the boundary -M^, where 
the algorithm switches between p — \ and p = 1, as m increases. Since A — > as m — > 00, then in the 
limit ~z — > i. Since the sequence is such that everytime we generate a 0, the algorithm incurs loss of 1 and 
everytime we generate a 1 it incurs loss of \, then its average loss converges to | but there's a hypothesis that 
achieves average loss of | so the average regret converges to \. ■ 

This problem is insightful in a number of ways. First it shows that there are problems that are batch 
learnable that are not online learnable, but when considering randomized algorithms can become online 
learnable. Additionally it shows that a RERM that is universal all-i-LOO stable, the next weakest stability 
notion, cannot be sufficient to guarantee the algorithm achieves no regret. This shows we cannot guarantee 
no regret for any RERM using only universal all-i-LOO stability or any weaker notion of LOO stability. 
This also suggests that it might be necessary to have a notion of LOO stability that is at least stronger than 
all-i-LOO stability to guarantee no regret. 

Another point reinforcing the fact that uniform-LOO stability might be necessary is that it is known 
that online learnability is not equivalent to batch learnability (as shown in the example below). Therefore, 
necessary stability conditions for online learnability should intuitively be stronger than for batch learnability. 

Example 6.2 (Example taken from Adam Kalai and Sham Kakade) There exists a problem that is learn- 
able in the batch setting but not learnable in the online setting by any deterministic or randomized online 
algorithm. 

Proof: Consider a threshold learning problem on the interval [0, 1], where the true hypothesis h* is such that 
for some x* <G [0, 1], h(x) = 2I(x > x*) — 1. Given an observation z — (x, h*(x)), we define the loss 
incured by a hypothesis h e H as L(h, (x, h* (x))) = ^'^Wix) ^ foj . % = > ^ _ g ^ ^ the 

set of all threshold functions on [0, 1]. Since this is a binary classification problem and the VC dimension of 
threshold functions is finite (2), then we conclude this problem is batch learnable. In fact by existing results, 
it is batch learnable by an ERM that is all-i-LOO stable. However in the online setting consider an adversary 
who picks the sequence of inputs by doing the following binary search: x\ = \ and Xi = — 2/i_i2~ 4 , 
and yi = —hi(xi), so that the observation by the learner at iteration i is Zi = (xi,yi). This sequence is 
constructed so that the learner always incur loss of 1 at each iteration, and after any number of iterations m, 
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the hypothesis h — 21 (x > x m+ i) — 1 achieves loss on the entire sequence Zi,Z2, ■ ■ ■ , z m . This implies 
the average regret of the algorithm is 1 for all to. Additionnally even if we allow randomized algorithms such 
that the prediction at iteration i by the learner is effectively a distribution over { — 1,1} where pi denote the 
probability P(hi(xi) = 1) for the distribution over hypotheses chosen by the learner, then the expected loss 
of the learner at iteration i is 1+t/i ~ 2?)it ' i . If xi are chosen as before but yi — —I(pi > 0.5). Then again at 
each iteration the learner must incur expected loss of at least \ but the hypothesis h = 2I(x > x m+ i) — 1 
achieves loss of on the entire sequence z\, 22, • ■ • , z m . Hence the expected average regret is > | for all m, 
so that with probability 1, the average regret of the randomized algorithm is > ^ in the limit as m goes to 
infinity. Hence we conclude that this problem is not online learnable. ■ 

6.1 Necessary Stability Conditions for Online Learnability in Particular Settings 

6.1.1 Binary Classification Setting 

We now show that if we restrict our attention to the binary classification setting {f(h, (x, y)) — I(h(x) 7^ y) 
for y £ {0, 1}), online learnability is equivalent to the existence of a (potentially randomized) uniform LOO 
stable RERM. 

Our argument uses the notion of Littl estone dimension , which was shown to characterize online learn- 
ability in the binary classification setting. iBen-David et al. I (120091) have shown that a classification problem 
is online learnable if and only if the class of hypothesis has finite Littlestone dimension. 

By our current results we know that if there exists a uniform LOO stable RERM, the classification problem 
must be online learnable and thus have finite Littlestone dimension. We here show that finite Littlestone 
dimension implies the existence of a (potentially randomized) uniform LOO stable RERM. To establish this, 
we use the fact that when H is infinite but has finite Littlestone dimension (Ldim('H) < 00), Weighted 
Majority can be adapted to be a no-regret algorithm by playing distributions over a fixed finite set of experts 
(of size < m Ldlm (^0 when playing for m rounds) derived from H jBen-David et all l2009h : 

Theorem 35 For any binary classification problem with hypothesis space % that has finite Littlestone dimen- 
sion Ldim('H) and number of rounds to, there exists a Randomized uniform-LOO stable RERM algorithm. 

In particular, it has no regret at rate t regret (t) < -\/21og(TO) Ldim('H)[-^j + lo |^ + 1+2 2 1 "^ 2 - > ]/or all t < to. 

Proof: The algorithm proceeds by constructing the same set of expert as in IBen-David et alj (2009) from 
Ti, which has number of experts < m Ldlm CH) f or m r0 unds. The previously mentioned Weighted Majority 
algorithm on this set achieves no regret at rate e regr e t (t) < \J 2 log(m) Ldim('H) [-^ + + 1+2 2 1 "^ 2 - ) ] for 
all t < to (since the maximum instantaneous regret is 1) and is a Randomized uniform-LOO stable RERM 
as shown in theorem [33] ■ 

This result implies that finite littlestone dimension is equivalent to the existence of a (potentially random- 
ized) uniform LOO stable RERM, and therefore that online learnability in the binary classification setting is 
equivalent to the existence of a (potentially randomized) uniform LOO stable RERM: 

Corollary 36 A binary classification problem is online learnable if and only if there exists a (potentially 
randomized) uniform-LOO stable RERM. 

6.1.2 Problems with Sub-Exponential Covering 

For any e > 0,letC e = {C C U\Vh' eH,3he Cs.i.Vz G Z : \ f(h,z) - f{h',z)\ < e}. C t is the set of all 
subsets CofH such that for any h! € H, we can find an h € C that has loss within e of the loss of hf at all 

ze z. 

We define the e-covering number of the tuple (T-L, Z, /) as N(H, Z, /, e) = infc e c e \C\, i.e. the minimal 
number of hypotheses needed to cover the loss of any hypothesis in H within e. We will show that we 
can guarantee no-regret with a Randomized uniform-LOO stable RERM algorithm (e.g. Hedge) as long as 
there exists a sequence {e;}^o that is o(l) and such that for any number of rounds 777: N(H, Z, f, e m ) is 
o(exp(m)). 

Theorem 37 Any learning problem (with instantaneous regret bounded by B) where there exists a sequence 
{e m }m=o tnat ' s an d sucn that Z, /, e m )}m=o ' s o(exp(m)), is online learnable with a Ran- 

domized uniform-LOO stable RERM algorithm. In particular, when playing for to rounds it has no regret at 

rate e regret {t) < By/2log(N(H,Z, f, e m ))[^ + ^ + ^W 21 ] + z m for all t < m. 

Proof: Suppose we know we must do online learning for to rounds. Then we can construct an e m -cover 
C of (H,Z,f) such that C C % and \C\ = N(H, Z, /, e m ). From the previous theorem, we know 

that running Hedge on the set C guarantees that ~ J2i=i ^■h i ^Pe i [f(hi, Zi)] - inf/ lG c 7 J2i=i fO 1 , Zi) < 
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By/2 \og(N(H, Z, /, em ))[^ + l2|i£i+ l±^ei] for all t < m. By definition of C, inf heC I £* =1 /(>*> *i) < 
mi hen j Y?i=i f(h, Zi) + e m for all t < m. So we conclude e regret (i) < B v / 2\og(N(H, Z, /, e m ))[^j + 
1211*) +i±^)] +em for al H < TO . H 

This theorem applies to a large number of settings. For instance, if we have a problem where /(•, z) is 
if -Lipschitz continuous at all z e -Z with respect to some norm 1 1 • 1 1 on "H, and "H C R d for some finite d and 
has bounded diameter D under || • || (i.e. sup^,^ \\h - h'\\ < D). Then N(H, Z, /, e) is 0{K{^) d ) for 

all e > 0. Choosing e m = ^ implies we can achieve no regret at rate e Kgmi (t) < 0(B\J ^ slyK ^ +d /°^ mDS> ) 
for all t < m. This notion also allows to handle highly discontinuous loss functions. For instance consider 
the case where Z = U = R and the loss f(h, z) = l-I(h£ Q)I(z e Q) - I(h <£ Q)I(z Q), i.e. the loss 
is if both /i and z are rational, or both irrational, and the loss is 1 is one is rational and the other irrational. 
In this case, the set C = {1, V2} is an e-cover of {H, Z, /} for any e > and thus we can achieve no-regret 
at rate 0(-^=) by running Hedge on the set C. 

7 Conclusions and Open Questions 

In this paper we have shown that popular online algorithms such as FTL, FTRL, Mirror Descent, gradient- 
based methods and randomized algorithms like Weighted Majority and Hedge can all be analyzed purely in 
terms of stability properties of the underlying batch learning algorithm that picks the sequence of hypotheses 
(or distribution over hypotheses). In particular, we have introduced the notion of online stability, which is 
sufficient to guarantee online learnability in the general learning setting for the class of RERM and RSLM 
algorithm. Our results allow to relate a number of learnability results derived for the batch setting to the online 
setting. There are a number of interesting open questions related to our work. First, it is still an open question 
to know whether for the general class of always AERM (at o(l) rate) it is sufficient to be online stable (at o(l) 
rate) to guarantee no regret, or show a counter-example that proves otherwise. The presented examples seem 
to suggest that a problem is online learnable only if there exists a uniform-LOO stable or online stable (and 
always AERM) algorithm, or at least with some form of LOO stability in between online stable and all-i-LOO 
stable. This has been verified in the binary classification setting where we have shown that online learnability 
is equivalent to the existence of a potentially randomized uniform-LOO stable RERM. While we haven't been 
able to provide necessary conditions for online learnability in the general learning setting, we have shown that 
all problems with a sub-exponential covering are all online learnable with a potentially randomized uniform- 
LOO stable RERM. An interesting open question is whether the notion of sub-exponential covering we have 
introduced turns out to be equivalent to online learnability in the general learning setting. If this is the case, 
this would establish immediately that existence of a (potentially randomized) uniform-LOO stable RERM is 
both sufficient and necessary for online learnability in the general learning setting. 
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